I wish you a great semester!

Some Announcements

Hint: It is very important to have a GitHub account, if you would like to be a data scientist or apply a data analyst/scientist job.

Wednesday: 11.40-12.30

Friday: 14.40-15.30

Data Types

Data Types are an important concept of statistics, which needs to be understood, to correctly apply statistical measurements to your data and therefore to correctly conclude certain assumptions about it. Understanding data types results in doing exploratory data analysis which the one of the important of the data analysis project.

Categorical Data: They denote the variable having character observations such as gender, language etc. They are sometimes represented by numbers (0: male, 1:female), but these numbers don’t have any mathematical meaning.

Nominal Data: Nominal values represent discrete units and are used to label variables, that have no quantitative value. Just think of them as labels. Note that nominal data that has no order. For example,

Gender: Male, Female

Ordinal Data: Ordinal values represent discrete and ordered units. As you would guess from it’s name, order have an importance. For example,

Numerical Data

Discrete Data: We speak of discrete data if its values are distinct and separate. In other words: We speak of discrete data if the data can only take on certain values. This type of data can’t be measured but it can be counted. For example,

Number of defective item in a box.

Continuous Data: Continuous Data represents measurements and therefore their values can’t be counted but they can be measured. For example,

Height of a person.

Interval Data Interval values represent ordered units that have the same difference. Therefore we speak of interval data when we have a variable that contains numeric values that are ordered and where we know the exact differences between the values. The problem about the interval data is zero have no real meaning. That’s why a lot of descriptive and inferential statistics can’t be applied.

For example, Temperature.

Ratio Data : Ratio values are also ordered units that have the same difference. Ratio values are the same as interval values, with the difference that they do have an absolute zero. In other words, zero has its real meaning.

For example, length.

Data type categorazing can be also done in a different way. For example, qualitative vs quantitative, Cross-Sectional vs Panel Data etc.

Explaratory Data Analysis

This term is generally called EDA. EDA is an iterative cycle. You:

EDA is not a formal process with a strict set of rules. More than anything, EDA is a state of mind. During the initial phases of EDA you should feel free to investigate every idea that occurs to you. Meanwhile, some of your ideas will be waste, some will be successful.

EDA is an important part of any data analysis, even if the questions are handed to you on a platter, because you always need to investigate the quality of your data. Data cleaning is just one application of EDA: you ask questions about whether your data meets your expectations or not. To do data cleaning, you’ll need to deploy all the tools of EDA: visualisation, transformation, and modelling.

EDA can be categorized as univariate and multivariate. Univariate means that you are investigating one variable. On the other hand, multivariate means that you are handling with two or more variables. Usually, two variables are considered in the multivariate EDA.

Before applying multivariate EDA, perform univariate EDA.

Boston Housing Data

The Boston data frame has 506 rows and 14 columns.

This data frame contains the following columns:

crim: per capita crime rate by town.

zn: proportion of residential land zoned for lots over 25,000 sq.ft.

indus: proportion of non-retail business acres per town.

chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

nox: nitrogen oxides concentration (parts per 10 million).

rm: average number of rooms per dwelling.

age: proportion of owner-occupied units built prior to 1940.

dis: weighted mean of distances to five Boston employment centres.

rad: index of accessibility to radial highways.

tax: full-value property-tax rate per $10,000.

ptratio: pupil-teacher ratio by town.

black: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.

lstat: lower status of the population (percent).

medv: median value of owner-occupied homes in $1000s.

Let’s read the dataset.

Be careful! The dataset must be in your working directory.

To find out the current working directory.

getwd()

To set your working directory.

setwd()
boston=read.csv("Boston.csv",header=T)

Then, print the first six rows of the dataset.

head(boston)

As defined above, the dataset has 506 rows and 14 columns. However, sometimes such information isn’t given. In this case, we use dim() command can be used.

dim(boston)
## [1] 506  15

One additional column is caused by the index column denoted X. Let’s remove it.

boston=boston[,-1]
dim(boston)
## [1] 506  14

Then, we need to look at how R read the variables. i.e what are the class of variables in the dataset?

str(boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : int  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

The output shows the class of the variables. At this stage, we need to be careful because if any variable has a wrong class, this will affect the whole analysis negatively.

Let’s look at chas.

class(boston$chas) #class function show the class of a variable 
## [1] "integer"

Let’s remind the definition,

chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

It is a dummy variable, i.e it is a categorical variable. It is read in wrong class. To make it correct,

boston$chas=as.factor(boston$chas)
class(boston$chas)
## [1] "factor"

Summary Statistics

The information that gives a quick and simple description of the data including mean, median, mode, minimum value, maximum value, range, standard deviation, etc.

The easiest way to obtain of summary statistics of the variables in the dataset;

summary(boston)
##       crim                zn             indus       chas   
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   0:471  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1: 35  
##  Median : 0.25651   Median :  0.00   Median : 9.69          
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14          
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10          
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74          
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00

The average of nitrogen oxides concentration is 0.555. The minimum of nitrogen oxides concentration is 0.38, while its maximum is 0.871. The half of the nitrogen oxides concentration is below or above 0.538. 25% of the nitrogen oxides concentration is below 0.449 and above 0.624. Lastly, it can be said that the variable might have a right skewed distribution since there is a considereable difference between third quantile and maximum value.

Also, out of 506 observations, for 35 of them, tract bounds the river, and doesn’t for the remaining part.

If you are interested in a specific variable, and investigate its summary statistics, use well known functions.

mean(boston$nox)
## [1] 0.5546951
median(boston$nox)
## [1] 0.538
min(boston$nox)
## [1] 0.385
max(boston$nox)
## [1] 0.871
var(boston$nox) #variance of the variable
## [1] 0.01342764
sd(boston$nox) #standard deviation of the variable
## [1] 0.1158777

Univariate EDA

Categorical Data

Frequency Table

Frequency refers to the number of times an event or a value occurs. A frequency table is a table that lists items and shows the number of times the items occur. It is generally applied on categorical variables.

table(boston$chas) #creates frequency table 
## 
##   0   1 
## 471  35

We say that out of 506 towns, tract bounds the Charles river for 471 towns.

Bar Plot

A bar chart represents data in rectangular bars with length of the bar proportional to the value of the variable. It is used for the visualization of the categorical variable.

To create bar plot in R, you need to create the frequency table of the categorical variable at first.

Research Question

Variable:chas

What is the frequency distribution of chas?

chas=table(boston$chas)
barplot(chas)

barplot(chas,col=c("red","yellow"),main="Bar Plot of Charles River Dummy Variable", ylim=c(0,500), names.arg=c("Otherwise", "tract bounds river"))
text(chas,labels=chas)

# col argument fills the bar
# main argument adds title
#ylim argument arranges the y-axis
#names.arg argument changes the bar names
#text argument shows the frequencies on the plot.  
#chas is the name of the frequency table

Also, out of 506 observations, for 35 of them, tract bounds the river, and doesn’t for the remaining part.

Continuous Variable

As a numerical EDA for continuous variables, the summary statistics including mean, median, standard deviation etc. are considered. You can find how to do in R above.

Histogram

A histogram represents the frequencies of values of a variable bucketed into ranges. Histogram is similar to bar chat but the difference is it groups the values into continuous ranges. Each bar in histogram represents the height of the number of values present in that range.

Research Question

Variable: Crime Rate

What is the distribution of median value of owner-occupied homes in $1000s?

hist(boston$medv) #hist function is used to draw histogram

hist(boston$medv,col="red",main="Histogram of Median",xlab="Median Value of Owner-Occupied")

#xlab changes the x axis name. 

Same Histogram with different bin

Bin: The bar in the histogram is called bin.

hist(boston$medv,col="red",main="Histogram of Median",xlab="Median Value of Owner-Occupied",breaks = 20)

#xlab changes the x axis name. 
#breaks sets the number of bin in the histogram

It is seen that the median value of owner-occupied homes have right skewed distributions.

Box Plot

It is created based on Tukey’s Five Number Summary including minimum, maximum, median, first and third quartile. The box plot can be used for two main purposes,

Anatomy of Box Plot

Therefore, it is suitable for both univariate and multivariate EDA.

Research Question

Variable: Crime Rate

What is the distribution of median value of owner-occupied homes in $1000s?

boxplot(boston$medv)

boxplot(boston$medv,main="Box Plot of Median Value of owner-occupied homes in $1000s",xlab="Median Value of Owner-Occupied",col="red")

It is seen that the interested variable has right skewed distribution most probably caused by outliers. The median of the data is slightly above 20.

Box Plot on the top of the Histogram

# Draw the boxplot and the histogram 
layout(mat = matrix(c(1,2),2,1, byrow=TRUE),  height = c(1,8))
par(mar=c(0, 3.1, 1.1, 2.1))
boxplot(boston$medv,main="Box Plot of Median Value of owner-occupied homes in $1000s",xlab="Median Value of Owner-Occupied",col="red",horizontal=TRUE,frame=F)
par(mar=c(1, 3.1, 1.1, 2.1))
hist(boston$medv,col="red",xlab="Median Value of Owner-Occupied",main="")

#xlab changes the x axis name. 

Multivariate EDA

Categorical Variables

In order to consider MEDA, we need to have more than one variable to conduct MEDA. However, we have only one categorical variable for our cases. So, we will create our own categorical variable which is also a part of EDA.

I want to apply the transformation on rm and medv variables.

rm: average number of rooms per dwelling.

medv: median value of owner-occupied homes in $1000s.

boston$rm_dummy=ifelse(boston$rm>mean(boston$rm),1,0)
#if the value is greater than its average, it takes on a value 1. 
boston$medv_dummy=ifelse(boston$medv>mean(boston$medv),1,0)
head(boston)

rm: average number of rooms per dwelling.

medv: median value of owner-occupied homes in $1000s.

Contingency Table

Wikipedia Definition

In statistics, a contingency table (also known as a cross tabulation or crosstab) is a type of table in a matrix format that displays the (multivariate) frequency distribution of the variables. They are heavily used in survey research, business intelligence, engineering and scientific research.

You can create contingency table by using table() in R.

Research Question

Is there a relationship between average number of rooms per dwelling and income?

table(boston$rm_dummy,boston$medv_dummy)
##    
##       0   1
##   0 229  49
##   1  68 160
#row denotres the first input
#column denotes the second input

It can be said that the number of room of the dwellings whose income below the average is mostly below the average. Same interpretation can be made for above case.

Grouped Bar Plot

counts=table(boston$rm_dummy,boston$medv_dummy)
barplot(counts, main="Room Distribution by Income",
  xlab="Median value of Income", col=c("darkblue","red"),
  legend = c("Below the average","Above the average"), names.arg=c("Below the average", "above the average"),beside=TRUE)

It is said that for town where the income is below the average, most the room per dwelling is below the average. On the other hand, most the room per dwelling is below the average for town whose median income is greater than its average.

Continuous Variables

Scatter Plot

Scatter plot is a graphical way used to display the relationship between two continuous variables by using dots.

Research Question

Is there any relationship between crime rate and income ?

For this case, we have two continuous variables which are crime rate and income. It is a good idea to use scatter plot to see the relationship between these two variables.

plot(boston$medv,boston$crim)

#logic-> plot(independent variable, dependent variable)
plot(boston$medv,boston$crim,
     pch=18, 
     cex=2, 
     col="#69b3a2",
     xlab="Median Income", ylab="Crime Rate",
     main="The Relationship between Income and Crime Rate"
     )

It is seen that the relationship is not linear but there is a negative relationship between these two variabes. When income increases the crime rate decreases.

You can customize your scatter plot using the following arguments in the plot funciton.

cex → shape size lwd → line width col → control colors lty → line type pch → marker shape type → link between dots

Continuous Variables and Categorical Variables

Research Question

Is there any relationship between crime rate and income (for dummy one)?

We have one continous, crime rate, and one categorical ,income dummy, variable. For such cases, we can prefer boxplot.

boxplot(boston$crim~boston$medv_dummy,main="The Boxplot of Crime Rate wrt Income",col=c("red","yellow"),xlab="Median of Income",ylab="Crime Rate",ylim=c(0,20))
legend("topright", legend=c("0=Below the Average","1= Above the Average"),col=c("red","yellow",bty = "n", pch=25 , pt.cex = 3, cex = 1, horiz = FALSE, inset = c(0.03, 0.1))) 

It can be said that the crime rate has right skewed distribution and also it has outliers when the median of income is less than its average. For the town with income greater than its average, the crime rate has right skewed distribution with outliers. Also, its crime rate is slightly less than the towns whose income is less than average.

Lattice Plot

It is a powerful and elegant high-level data visualization system, with an emphasis on multivariate data. It generates a plot splitted into the level of a categorical variable.

Research Question

What is the relationship between crime rate and number of black for each level of income?

In this case, we are interested in relationship between crime rate and number of black with respect to level of income the variable we created.

Numeric variables: crime rate, number of black.

Categorical(Factor) Variable: Income (Dummy one).

library(lattice)
#install.packages(lattice)
xyplot(boston$crim  ~ boston$black | factor(boston$medv_dummy) , data=boston , pch=20 , cex=3 )

There’s no clear pattern for the case where income is less than average, but the negative relationship can be observed for the case where income is greater than average.

More Examples

Research Question

Does age have normal distribution with respect to level of income?

If you’d like to test normality using graphical way, you should use qqplot.

qqmath(~ boston$age | factor(boston$medv_dummy), data = boston,col="orange", f.value = ppoints(100),auto.key = list(columns = 2),type = c("p", "g"), aspect = "xy")

It is seen that they do not follow normal distribution.

Research Question

What is the distribution of age with respect to level of income?

histogram(~ boston$age | factor(boston$medv_dummy), data = boston,col="orange")

For the case where income is less than its average, the age has left skewed distribution. For the case where income is greater than its average, the age shows bimodal distribution.

You can also check this question by using density plot.

densityplot(~ boston$age | factor(boston$medv_dummy), data = boston,col="orange",  plot.points = FALSE, auto.key = TRUE)

We can conclude the same results as seen.

Research Question

Now, I am interested in the relationship between crime rate and age with respect to level of income. However, I don’t want to look at scatter plot among them because most of the time I failed when using scatter plot. That’s why, I will categorize age variable like median income and room number.

boston$age_dummy=ifelse(boston$age>median(boston$age),1,0)

Box-and-whisker Plot

bwplot(boston$crim~factor(boston$age_dummy)| factor(boston$medv_dummy),data=boston)

The black dot represents the median for each variable. The x axis and y axis denote the age level and crime rate, respectively. We can say that we have outliers for most case. All case have right skewed distribution because of outliers. Then, the crime rate is higher for age being above the average(68.58) compared to age being less than its mean on the average.

For more R graphics, please visit (https://www.r-graph-gallery.com/).

Exercise

  1. Please read bank-full.txt dataset with an appropriate command. The data is related with direct marketing campaigns of a Portuguese banking institution. The variable descriptions are given below.

Input variables: 1 - age 2 - job : type of job 3 - marital : marital status 4 - education 5 - default: has credit in default? 6 - housing: has housing loan? 7 - loan: has personal loan? 8 - contact: contact communication type 9 - month: last contact month of year 10 - day_of_week: last contact day of the week 11 - duration: last contact duration, in seconds (numeric). 12 - campaign: number of contacts performed during this campaign and for this client 13 - pdays: number of days that passed by after the client was last contacted from a previous campaign 14 - previous: number of contacts performed before this campaign and for this client 15 - poutcome: outcome of the previous marketing campaign 16 - y - has the client subscribed a term deposit? (binary: ‘yes’,‘no’)

a Please check the class of variables in the dataset. If you detect one being wrong class, please make a correction.

## 'data.frame':    45211 obs. of  17 variables:
##  $ age      : int  58 44 33 47 33 35 28 42 58 43 ...
##  $ job      : Factor w/ 12 levels "admin.","blue-collar",..: 5 10 3 2 12 5 5 3 6 10 ...
##  $ marital  : Factor w/ 3 levels "divorced","married",..: 2 3 2 2 3 2 3 1 2 3 ...
##  $ education: Factor w/ 4 levels "primary","secondary",..: 3 2 2 4 4 3 3 3 1 2 ...
##  $ default  : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 1 1 ...
##  $ balance  : int  2143 29 2 1506 1 231 447 2 121 593 ...
##  $ housing  : Factor w/ 2 levels "no","yes": 2 2 2 2 1 2 2 2 2 2 ...
##  $ loan     : Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ...
##  $ contact  : Factor w/ 3 levels "cellular","telephone",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ day      : int  5 5 5 5 5 5 5 5 5 5 ...
##  $ month    : Factor w/ 12 levels "apr","aug","dec",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ duration : int  261 151 76 92 198 139 217 380 50 55 ...
##  $ campaign : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ pdays    : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ previous : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ poutcome : Factor w/ 4 levels "failure","other",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ y        : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...

Answer: It is seen that there is no variable with wrong class.

b Get the summary table. Then, make an interpretation about one numeric and one categoric variable that you select.

##       age                 job           marital          education    
##  Min.   :18.00   blue-collar:9732   divorced: 5207   primary  : 6851  
##  1st Qu.:33.00   management :9458   married :27214   secondary:23202  
##  Median :39.00   technician :7597   single  :12790   tertiary :13301  
##  Mean   :40.94   admin.     :5171                    unknown  : 1857  
##  3rd Qu.:48.00   services   :4154                                     
##  Max.   :95.00   retired    :2264                                     
##                  (Other)    :6835                                     
##  default        balance       housing      loan            contact     
##  no :44396   Min.   : -8019   no :20081   no :37967   cellular :29285  
##  yes:  815   1st Qu.:    72   yes:25130   yes: 7244   telephone: 2906  
##              Median :   448                           unknown  :13020  
##              Mean   :  1362                                            
##              3rd Qu.:  1428                                            
##              Max.   :102127                                            
##                                                                        
##       day            month          duration         campaign     
##  Min.   : 1.00   may    :13766   Min.   :   0.0   Min.   : 1.000  
##  1st Qu.: 8.00   jul    : 6895   1st Qu.: 103.0   1st Qu.: 1.000  
##  Median :16.00   aug    : 6247   Median : 180.0   Median : 2.000  
##  Mean   :15.81   jun    : 5341   Mean   : 258.2   Mean   : 2.764  
##  3rd Qu.:21.00   nov    : 3970   3rd Qu.: 319.0   3rd Qu.: 3.000  
##  Max.   :31.00   apr    : 2932   Max.   :4918.0   Max.   :63.000  
##                  (Other): 6060                                    
##      pdays          previous           poutcome       y        
##  Min.   : -1.0   Min.   :  0.0000   failure: 4901   no :39922  
##  1st Qu.: -1.0   1st Qu.:  0.0000   other  : 1840   yes: 5289  
##  Median : -1.0   Median :  0.0000   success: 1511              
##  Mean   : 40.2   Mean   :  0.5803   unknown:36959              
##  3rd Qu.: -1.0   3rd Qu.:  0.0000                              
##  Max.   :871.0   Max.   :275.0000                              
## 

The minimum and maximum age in the data are 18 and 95, respectively. Half of the age of customers are below and above 39. Also, the average age of customer is 40.94. Lastly, It can be said that we could expect a right skewed distribution for the age since there is a huge gap between third quartile and maximum value.

For a categorical variabe, we consider loan. Among 45211 customers, 37967 of them have personal loan, and rest of them don’t have.

c Make an inference about the distribution of age.

Note: Please draw Boxplot addition to histogram.

boxplot(bank$age,col="blue",main="Box Plot of Age",xlab="Age")

We can say that the age has right skewed distrbution as we concluded in the previous part. It has also outlier observations that violates the shape of the distribution.

d Check there is a relationship between marital status and housing and personal loan.

It can be said that most of the customers having housing loan are married, same interpreation can be also made for the case where customers does not have housing loan.

It can be said that most of the customers who don’t have personal loan are married, same interpreation can be also made for the case where customers having personal loan.

e What is the relationship between age and duration with respect to subscribing term deposite situation?

For both case, we cannot observe a certain relationship pattern.

You will define three research questions and try to answer them using appropriate tools. If you find something making sense, please inform me.

This note is prepared by Ozancan Ozdemir. (e-mail: , Room No: 234)

You have to do your quiz until 23.59 on ODTUClass